First, read in the NYPD Shooting incident data. The CSV file can be downloaded from https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic.
You will need tidyverse and lubridate install.packages(“tidyverse”) install.packages(“lubridate”) install.packages(“plotly”) library(tidyverse) library(lubridate) library(plotly)
shooting_data <- read_csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD")
Now eliminate INCIDENT_KEY and all columns after VIC_RACE and convert OCCUR_DATE to a date data type
shooting_data <- shooting_data %>%
select(OCCUR_DATE:VIC_RACE) %>%
mutate(OCCUR_DATE = mdy(OCCUR_DATE))
Show summary of the data
summary(shooting_data)
## OCCUR_DATE OCCUR_TIME BORO PRECINCT
## Min. :2006-01-01 Length:23568 Length:23568 Min. : 1.00
## 1st Qu.:2008-12-30 Class1:hms Class :character 1st Qu.: 44.00
## Median :2012-02-26 Class2:difftime Mode :character Median : 69.00
## Mean :2012-10-03 Mode :numeric Mean : 66.21
## 3rd Qu.:2016-02-28 3rd Qu.: 81.00
## Max. :2020-12-31 Max. :123.00
##
## JURISDICTION_CODE LOCATION_DESC STATISTICAL_MURDER_FLAG
## Min. :0.0000 Length:23568 Mode :logical
## 1st Qu.:0.0000 Class :character FALSE:19080
## Median :0.0000 Mode :character TRUE :4488
## Mean :0.3323
## 3rd Qu.:0.0000
## Max. :2.0000
## NA's :2
## PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP
## Length:23568 Length:23568 Length:23568 Length:23568
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## VIC_SEX VIC_RACE
## Length:23568 Length:23568
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The visualizations I will be using do not require any filtering of missing values,but if it did I could do it with:
shooting_data_no_missing <- shooting_data %>%
filter(PERP_AGE_GROUP != "NA" & PERP_AGE_GROUP != "UNKNOWN" & PERP_SEX != "NA" &
PERP_RACE != "NA" & VIC_AGE_GROUP != "NA" & VIC_AGE_GROUP != "UNKNOWN" &
VIC_SEX != "NA" & VIC_RACE != "NA")
Group the data by month for both murders and shootings for the first visualization
First visualization - Shootings Each Month
Group the data by borough and year for murder and shootings for the second visualization
Sort the 5 boroughs by shooting count since 2010
boro_group %>% filter(year >= 2010) %>% group_by(BORO) %>% summarize(shootings = sum(shootings)) %>% slice_max(shootings, n = 5)
## # A tibble: 5 x 2
## BORO shootings
## <chr> <int>
## 1 BROOKLYN 6484
## 2 BRONX 4551
## 3 QUEENS 2389
## 4 MANHATTAN 1945
## 5 STATEN ISLAND 471
Second visualization: Murders and shootings by year for each borough
boro_group %>%
# filter(BORO == "BRONX" | BORO == "BROOKLYN") %>%
ggplot(aes(x = year, y = shootings, fill = BORO)) +
geom_col() +
theme(legend.position = "bottom"
) +
labs(title = str_c("Shootings by Borough"), y = NULL, x = "Year")
Bias identification: At first I was very interested in seeing how race and age might play out in these shooting incidents, but then realized how fraught with biases both of these were, both my own and in the race identifications available in the data as well as the very broad age groupings that were used.
So to avoid these biases both my own and in the data, I looked only at murders and shootings as they relate to time, either month of the year or year over year. The exception to this is the analysis of the boroughs with the highest number of murders. One might think that Manhattan is a safer place from this, but instead it could be that most murders happen in the evenings and Manhattan has more businesses than residences. To find out if this could be biasing the results would require further research and data.